flowchart LR
EDA["<div style='line-height:1.0;'>Exploratory<br>Data<br>Analysis</div>"]
--> Clean["<div style='line-height:1.0;'>Clean<br>Original<br>Data</div>"]
--> BuildModel["<div style='line-height:1.0;'>Build<br>RNN<br>Model</div>"]
--> Train["<div style='line-height:1.0;'>Train<br>Model</div>"]
--> Tune["<div style='line-height:1.0;'>Tune<br>Hyperparameters</div>"]
--> OutputFinal["<div style='line-height:1.0;'>Final<br>Models</div>"]
--> Submit["<div style='line-height:1.0;'>Submit<br>Results</div>"]
Tune --> Train
DTSA 5511 Introduction to Machine Learning: Deep Learning
Week 4: Natural Language Processing with Disaster Tweets
1 Problem Description
This projects builds a Recurrent Neural Network model for the Natural Language Processing with Disaster Tweets competition, hosted on Kaggle (Howard et al. 2019), with the objective of developing a machine learning model that can accurately classify tweets as disaster-related or not. The dataset consists of 10,000 manually labeled tweets, creating a binary classification task where tweets are labeled 1 for disaster-related content and 0 non disaster-related content.
To achieve this goal, this project will develop a neural network model using PyTorch to build of a Recurrent Neural Network (RNN) model, which is should be well-suited for processing sequential text data. The final predictions generated by the trained model will be submitted to Kaggle for evaluation.
To achieve this goal, this project will leverage PyTorch (Ansel et al. 2024) to design and implement a Recurrent Neural Network (RNN), a neural architecture well-suited for processing sequential text data. The trained model will generate predictions that will be submitted to Kaggle for evaluation.
This project will address the following research questions:
| Research Area | Question |
|---|---|
| Data Preparation | How should text data be preprocessed to maximize model performance? |
| Model Building | How do we implementation a RNN model in PyTorch? |
| Hyperparameter Tuning | What static hyperparameters should be defined and what dynamic hyperparameters should be tuned? |
| Model Performance | What performance metrics should be used? How do the models perform during training, validation, and testing? |
| Improvement Strategies | What methods can be used to further enhance model performance? |
Beyond answering these questions, the project aims to address the technical challenges related to RNN models, including mitigating overfitting, handling exploding gradients, and balancing model complexity with prediction accuracy.
The workflow for this research is summarized in Figure 1. The process begins with exploratory data analysis and preprocessing, followed by model training. Hyperparameter tuning is iteratively applied to refine model performance, culminating in the development of final models for Kaggle submission.
2 Exploratory Data Analysis
For the defined binary classification task test and training data were supplied tweets, location, keywords, an id and a class. Each training tween is labeled as either disaster-related (1) or not (0). The primary input feature is text (the original tweet content), supplemented by fields like keyword and location.
2.1 Training Data Columns and Types
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 7613 non-null int64
1 keyword 7552 non-null object
2 location 5080 non-null object
3 text 7613 non-null object
4 target 7613 non-null int64
dtypes: int64(2), object(3)
memory usage: 297.5+ KB
Table 2 details the available data. Of note are large numbers of non existant data in the location column, and a small amount of data missing in the keyword column. id and target are numbers while the other 3 columns are text.
2.2 Training Data Sample
| id | keyword | location | text | target | |
|---|---|---|---|---|---|
| 31 | 48 | ablaze | Birmingham | @bbcmtd Wholesale Markets ablaze http://t.co/l... | 1 |
| 32 | 49 | ablaze | Est. September 2012 - Bristol | We always try to bring the heavy. #metal #RT h... | 0 |
| 33 | 50 | ablaze | AFRICA | #AFRICANBAZE: Breaking news:Nigeria flag set a... | 1 |
| 34 | 52 | ablaze | Philadelphia, PA | Crying out for more! Set me ablaze | 0 |
| 35 | 53 | ablaze | London, UK | On plus side LOOK AT THE SKY LAST NIGHT IT WAS... | 0 |
Table 3 outputs the contents of a subset of the data. Data in the keyword column appears to be somewhat standardized while data in the location and text columns appear to be original inputs from the user.
2.3 Distribution of Target Values
It is common for binary classification training data to have an equal weight of true and false values in the input. This is calculated by counting the occurrence of each value.
Code
target Values in Training Data
In Figure 2 there are unequal values of each class in the data with a larger amount of negative values. This imbalance is important and must be accounted for during the model training and validation.
2.4 Sample of Positive Tweets
| id | keyword | location | text | target | |
|---|---|---|---|---|---|
| 0 | 1 | NaN | NaN | Our Deeds are the Reason of this #earthquake M... | 1 |
| 1 | 4 | NaN | NaN | Forest fire near La Ronge Sask. Canada | 1 |
| 2 | 5 | NaN | NaN | All residents asked to 'shelter in place' are ... | 1 |
| 3 | 6 | NaN | NaN | 13,000 people receive #wildfires evacuation or... | 1 |
| 4 | 7 | NaN | NaN | Just got sent this photo from Ruby #Alaska as ... | 1 |
In Table 4, positive tweets appear to have some relation to the disaster they are describing. The content appears to have multiple complex words and hashtags.
2.5 Sample of Negative Tweets
| id | keyword | location | text | target | |
|---|---|---|---|---|---|
| 15 | 23 | NaN | NaN | What's up man? | 0 |
| 16 | 24 | NaN | NaN | I love fruits | 0 |
| 17 | 25 | NaN | NaN | Summer is lovely | 0 |
| 18 | 26 | NaN | NaN | My car is so fast | 0 |
| 19 | 28 | NaN | NaN | What a goooooooaaaaaal!!!!!! | 0 |
In Table 5, negative tweets have content that appears irrelevant to a disaster. This content appears relatively generic and not specific to any event or location.
2.6 Word Count
To identify if data cleaning is necessary, the content of the tweets (text column) is visualized below. Each row is split into words lists by whitespace and a combined list of words is generated and counted.
Code
import numpy as np
from collections import Counter
def plot_word_horizontal_bar_chart(df, column, top_n=10, figsize=None):
"""
Plot a horizontal bar chart of the most common words.
:param words: List of words to analyze.
:param top_n: Number of most common words to display.
"""
# Count word frequencies
words = word_list = dataframe_to_word_list(train_df, column)
word_counts = Counter(words)
most_common = word_counts.most_common(top_n)
# Split words and their counts
labels, counts = zip(*most_common)
# print(labels[:10])
# print(counts[:10])
# # Plot the horizontal bar chart
if figsize is None:
figsize=(3, 6)
plt.figure(figsize=figsize)
plt.barh(labels, counts)
plt.xlabel("Count")
# plt.ylabel("Words")
plt.title(f"Top {top_n} Words in {column}")
plt.gca().invert_yaxis() # Invert the y-axis to display the highest count at the top
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
def dataframe_to_word_list(df, text_column):
"""
Convert a DataFrame column of text into a list of words.
:param df: The input DataFrame.
:param text_column: The name of the column containing text data.
:return: A list of words.
"""
# Tokenize each row into words and flatten into a single list
words = df[text_column].str.split().explode().tolist()
return [word for word in words if word and word is not np.nan]Code
text Column
location Column
keyword Column
The word count histograms in Figure 3 reveals a mixture of relevant and possible unnecessary words and characters in each column. Additionally some characters may be removed to cleanup the input into the model.
3 Data Cleaning
Based on the word count visualizations in Figure 3, it is evident that removing common stop words may have an effect on the model’s performance. This process will be implemented using the NLTK Python package (Bird, Klein, and Loper 2009). To allow data cleaning level to be a hyperparameter and track different preprocessing configurations, the cleaned datasets will be labeled with an a_<count> suffix, where the first cleaned dataset will be designated as a1.
3.1 Data Level a1: Removing Stop Words
Code
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
def filter_stop_words(df, column):
def filter_stop_words(word_list):
"""Filter stop words from a list of words."""
words = str(word_list).split()
words = [word for word in words if word != "nan"]
return " ".join([word for word in words if word.lower() not in stop_words and word])
# Apply the filtering function to the specified column
df[column] = df[column].apply(filter_stop_words)
return df
train_df_a1 = train_df.copy()
train_df_a1 = filter_stop_words(train_df_a1, 'text')
train_df_a1 = filter_stop_words(train_df_a1, 'location')
train_df_a1 = filter_stop_words(train_df_a1, 'keyword')3.2 Data Level a2: Removing Unnecessary Characters
It may be beneficial to further clean the data. We some high level techniques to normalize the input data.
Code
import re
def clean_df_text_column(df, column):
def clean_row(tweet):
words = tweet.split()
cleaned_words = []
for word in words:
# Remove URLs
# word = re.sub(r'http\S+|www\S+', '[URL]', word)
# Replace user mentions (@username) with a placeholder
# word = re.sub(r'@\w+', '[USER]', word)
# Remove hashtags but keep the word (e.g., "#earthquake" → "earthquake")
# word = re.sub(r'#(\w+)', r'\1', word)
# Remove unwanted characters (e.g., punctuation)
word = re.sub(r'[^\w\s]', '', word)
# Remove dashes
word = re.sub('-', '', word)
# Remove extra spaces (if any remain)
word = word.strip()
# Add the cleaned word to the list if it's not empty
if word and len(word) > 1:
cleaned_words.append(word.lower())
return " ".join(cleaned_words)
df[column] = df[column].apply(clean_row)
return df
train_df_a2 = train_df_a1.copy()
train_df_a2 = clean_df_text_column(train_df_a2, 'text')
train_df_a2 = clean_df_text_column(train_df_a2, 'location')
train_df_a2 = clean_df_text_column(train_df_a2, 'keyword')3.3 Final Cleaning Results
Final cleaning results from data level a2 are detailed below
3.3.1 Text Content
To determine if there are differences between cleaned positive and negative tweets a sample of randomly selected tweets from each class is output below.
3.3.1.1 Positive Tweets
Code
| id | keyword | location | text | target | |
|---|---|---|---|---|---|
| 4329 | 6148 | hijack | nigeria | criminals hijack lorries buses arrested enugu ... | 1 |
| 5085 | 7252 | nuclear20disaster | netherlands | fukushimatepco fukushima nuclear disaster incr... | 1 |
| 3871 | 5503 | flames | santo domingo alma rosa | soloquiero maryland mansion fire killed caused... | 1 |
| 6221 | 8880 | smoke | ktx | get smoke shit peace | 1 |
| 5710 | 8147 | rescuers | iminchina | video were picking bodies water rescuers searc... | 1 |
3.3.1.2 Negative Tweets
Code
| id | keyword | location | text | target | |
|---|---|---|---|---|---|
| 5997 | 8562 | screams | gladiator û860û757û | casually phone jasmine cries screams spider | 0 |
| 338 | 485 | armageddon | flightcity uk | official vid gt doublecups gtgt httpstcolfkmtz... | 0 |
| 5682 | 8109 | rescued | bournemouth | finnish hip hop pioneer paleface rescued drift... | 0 |
| 2598 | 3727 | destroyed | waco texas | always felt like namekians black people felt p... | 0 |
| 3510 | 5017 | eyewitness | rhode island | wpri 12 eyewitness news rhode island set moder... | 0 |
In the randomly selected tweets there are some differences between the two classes, but after the cleaning the differences are not readily apparent from a content perspective. Both positive and negative tweets have some level of text that is not readily comprehensible.
3.3.2 Visualizations
Code
def count_unique_words(input_df, column):
# Create a set to store unique words
unique_words = set()
# Iterate through each row in the column
for text in input_df[column]:
if isinstance(text, str): # Ensure the entry is a string
words = text.split() # Split into words and normalize to lowercase
unique_words.update(words) # Add words to the set
# Return the size of the set
return len(unique_words)Code
results = [
{
"class": "original",
"column": "text",
"count": count_unique_words(train_df, 'text')
},
{
"class": "a1",
"column": "text",
"count": count_unique_words(train_df_a1, 'text')
},
{
"class": "a2",
"column": "text",
"count": count_unique_words(train_df_a2, 'text')
},
{
"class": "original",
"column": "keyword",
"count": count_unique_words(train_df, 'keyword')
},
{
"class": "a1",
"column": "keyword",
"count": count_unique_words(train_df_a1, 'keyword')
},
{
"class": "a2",
"column": "keyword",
"count": count_unique_words(train_df_a2, 'keyword')
},
{
"class": "original",
"column": "location",
"count": count_unique_words(train_df, 'location')
},
{
"class": "a1",
"column": "location",
"count": count_unique_words(train_df_a1, 'location')
},
{
"class": "a2",
"column": "location",
"count": count_unique_words(train_df_a2, 'location')
},
]
results_df = pd.DataFrame(results)
results_df = results_df.rename({
"class": "Data Level",
"count": "Count",
"column": "Data Column",
}, axis="columns")
def plot_count_hist(input_df, column):
plt.figure(figsize=(4, 3))
# Create the barplot
ax = sns.barplot(
data=input_df.loc[input_df["Data Column"] == column],
y="Count",
x="Data Column",
hue="Data Level",
)
# Customize the legend position to appear below the plot
plt.legend(
title="Data Level", # Optional: Add a title to the legend
loc="upper center", # Center the legend horizontally
bbox_to_anchor=(0.5, -0.25), # Adjust the vertical position below the plot
ncol=3, # Display the legend in two columns (optional for compactness)
frameon=False, # Remove the legend border (optional)
)
plt.xlabel(None)
plt.tight_layout() # Adjust layout to prevent overlap
plt.show()In Figure 6 the number of unique values decreases at each data processing step. For the text column removing stop words slightly reduces the unique values count while performing text clean removes a significant number of values. location in Figure 7 follows a similar pattern, but the number of words removed by the stop word cleaning is higher that text. keyword in Figure 8 shows no change to the cleaning suggesting that the data format is defined in the user input and the original data is in a formatted state..
| Data Level | Description |
|---|---|
| Original | Original data without modifications |
| a1 | Stop words removed |
| a2 | a1 & lowered, removed white space, remove length less than 2, remove punctuation |
We will now process the training and test data using the above functions and same them to parquet files as input for testing:
Code
from pathlib import Path
data_path = Path("../data/preprocessed").resolve()
data_path.mkdir(exist_ok=True, parents=True)
train_data_path = Path(data_path, "train")
train_data_path.mkdir(exist_ok=True)
train_raw_filename = Path(train_data_path, "train_raw.parquet")
train_a1_filename = Path(train_data_path, "train_a1.parquet")
train_a2_filename = Path(train_data_path, "train_a2.parquet")
train_df.to_parquet(train_raw_filename)
train_df_a1.to_parquet(train_a1_filename)
train_df_a2.to_parquet(train_a2_filename)
test_data_path = Path(data_path, "test")
test_data_path.mkdir(exist_ok=True)
test_raw_filename = Path(test_data_path, "test_raw.parquet")
test_a1_filename = Path(test_data_path, "test_a1.parquet")
test_a2_filename = Path(test_data_path, "test_a2.parquet")
test_df = pd.read_csv("../data/test.csv")
test_df_a1 = test_df.copy()
test_df_a1 = filter_stop_words(test_df_a1, 'text')
test_df_a1 = filter_stop_words(test_df_a1, 'location')
test_df_a1 = filter_stop_words(test_df_a1, 'keyword')
test_df_a2 = test_df_a1.copy()
test_df_a2 = clean_df_text_column(test_df_a2, 'text')
test_df_a2 = clean_df_text_column(test_df_a2, 'location')
test_df_a2 = clean_df_text_column(test_df_a2, 'keyword')
test_df.to_parquet(test_raw_filename)
test_df_a1.to_parquet(test_a1_filename)
test_df_a2.to_parquet(test_a2_filename)3.4 Tokenization
To clean the data another step is converting the text into a numerical representation that a neural network can work with. This can be implemented using multiple methods and one popular library is the Transformers Python library developed by Wolf et al. (2022). As detailed by AkaraAsai on GitHub there are many options for pretrained transformers. For this project we will use the bert-base-uncased and bert-base-casedtokenizers via the PreTrainedTokenizer class in the transformers library.
def preprocess_dataframe(
df: pd.DataFrame,
tokenizer: PreTrainedTokenizer,
text_max_length: int,
keyword_max_length: int,
location_max_length: int,
):
ids, tokens, attentions, targets = [], [], [], []
max_length = text_max_length + keyword_max_length + location_max_length
df["keyword"] = df["keyword"].fillna("")
df["location"] = df["location"].fillna("")
for _, row in df.iterrows():
# Tokenize each component and add special tokens to indicate type
text_tokens = tokenizer.encode(
row["text"],
add_special_tokens=True,
truncation=True,
max_length=text_max_length,
padding="max_length",
return_tensors="pt",
).tolist()[0]
keyword_tokens = tokenizer.encode(
row["keyword"],
add_special_tokens=True,
truncation=True,
max_length=keyword_max_length,
padding="max_length",
return_tensors="pt",
).tolist()[0]
location_tokens = tokenizer.encode(
row["location"],
add_special_tokens=True,
truncation=True,
max_length=location_max_length,
padding="max_length",
return_tensors="pt",
).tolist()[0]
# Combine tokens
combined_tokens = text_tokens + keyword_tokens + location_tokens
# Create initial attention mask with 1s for all tokens
attention_mask = [1] * len(combined_tokens)
# Pad tokens and attention mask to max_length
padding_length = max_length - len(combined_tokens)
combined_tokens += [tokenizer.pad_token_id] * padding_length
attention_mask += [0] * padding_length
# Update attention mask to set positions with padding tokens to 0
attention_mask = [
0 if token == tokenizer.pad_token_id else mask
for token, mask in zip(combined_tokens, attention_mask)
]
# Collect processed data
ids.append(row["id"])
tokens.append(combined_tokens)
attentions.append(attention_mask)
try:
targets.append(row["target"])
except KeyError:
continue
# Return targets if they exist
if len(targets) > 0:
return pd.DataFrame(
{"id": ids, "tokens": tokens, "attention": attentions, "target": targets}
)
else:
return pd.DataFrame({"id": ids, "tokens": tokens, "attention": attentions})4 Recurrent Neural Network Model
To solve the disaster tweet classification problem, a RNN model was implemented using PyTorch (Ansel et al. 2024). This model is specifically designed to process and classify sequential text data while also leveraging additional features, such as keyword and location, to improve accuracy. The architecture has the following key components and characteristics:
4.1 Embedding Layers
The model uses three separate embedding layers (text_embedding, keyword_embedding, and location_embedding) to convert the categorical input (text, keyword, and location) into dense, low-dimensional vectors.
- Embedding Dimension: The size of these dense representations is a tunable hyperparameter (
embedding_dim, default: 128). - Padding Index: A padding index of 0 ensures uniform sequence lengths for inputs with varying sizes.
4.2 Recurrent Layer (LSTM)
The embeddings are concatenated into a combined representation, which serves as input to an LSTM (Long Short-Term Memory) layer first described by Hochreiter and Schmidhuber (1997). For this project the input dimension is can be tuned as a hyperparameter:
- Input Dimensions
- The concatenated embeddings have a dimensionality of \text{Embedding Dimension} \times 3 as all three embeddings are combined.
- Hidden Dimension The LSTM layer processes the input and outputs a hidden state with size
hidden_dim. To reduce the number of hyperparamaters this value is fixed at \text{Input Dimension} * 2 - Batch Processing: The
batch_first=Trueargument ensures the input tensors are structured as (batch size, sequence length, feature size).
4.3 Fully Connected Layer
The final hidden state of the LSTM, corresponding to the last time step, is passed through a fully connected (fc) layer to reduce the dimensionality to the output size of 1.
4.4 Sigmoid Activation
The output of the fully connected layer is passed through a sigmoid activation function, which scales the predictions to a range of [0, 1]. These outputs represent the probability of a tweet being disaster-related. For validation or testing, these probabilities are thresholded (e.g., \geq 0.5) to classify the output as either 1 or 0.
4.5 RNNWithMultiInput Class Definition
import torch
import torch.nn as nn
class RNNWithMultiInput(nn.Module):
def __init__(
self,
vocab_size,
use_attention=False,
embedding_dim=128,
hidden_dim=256,
output_dim=1,
):
super(RNNWithMultiInput, self).__init__()
self.use_attention = use_attention
self.text_embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
self.keyword_embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
self.location_embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
self.lstm = nn.LSTM(
embedding_dim * 3,
hidden_dim,
batch_first=True,
)
self.fc = nn.Linear(hidden_dim, output_dim)
self.sigmoid = nn.Sigmoid()
def forward(
self,
text_input_ids,
text_attention_mask,
keyword_input_ids,
keyword_attention_mask,
location_input_ids,
location_attention_mask,
):
# Embedding layers
text_emb = self.text_embedding(text_input_ids)
keyword_emb = self.keyword_embedding(keyword_input_ids)
location_emb = self.location_embedding(location_input_ids)
if self.use_attention is True:
text_emb = text_emb * text_attention_mask.unsqueeze(-1)
keyword_emb = keyword_emb * keyword_attention_mask.unsqueeze(-1)
location_emb = location_emb * location_attention_mask.unsqueeze(-1)
# Combine embeddings
combined_emb = torch.cat((text_emb, keyword_emb, location_emb), dim=2)
# Pass through LSTM
lstm_out, _ = self.lstm(combined_emb)
last_hidden_state = lstm_out[:, -1, :]
# Fully connected layer
logits = self.fc(last_hidden_state)
return self.sigmoid(logits).squeeze()4.6 Hyperparameter Tuning
Key hyperparameters for optimization include:
- Embedding Dimension (
embedding_dim): Controls the size of the feature space for text representation. - Hidden Dimension (
hidden_dim): Determines the capacity of the LSTM layer to capture sequential patterns. - Batch Size and Learning Rate: While not part of the architecture, these parameters significantly influence training efficiency and model performance.
4.6.1 Attention Mechanism Hyperparameter
The model includes an optional attention mechanism (enabled by the use_attention flag). If enabled, attention masks are applied to the embeddings to emphasize relevant parts of the input sequences while ignoring padded elements.
4.7 Rationale for Architecture
This architecture is well-suited for this problem because:
- The LSTM layer efficiently captures sequential dependencies in textual data, which is critical for understanding the context within tweets.
- By incorporating separate embeddings for
keywordandlocation, the model leverages additional information beyond the tweet text, potentially improving classification accuracy. - The flexibility to enable or disable attention mechanisms provides adaptability for datasets with varying levels of noise or irrelevant data.
5 Training
The training process for this project begins with an 80/20 train-test split of the dataset, ensuring a robust and reliable evaluation of model performance. A critical consideration in training a neural network is the selection of an appropriate validation metric. Given the binary classification nature of the task—where tweets are labeled as disaster-related (1) or non-disaster-related (0)—traditional metrics such as accuracy, precision, recall, and F1 score must be carefully evaluated in the context of class imbalance.
For this dataset, F1 score is chosen as the primary evaluation metric. This decision is driven by the inherent class imbalance in the disaster-related tweets, where false positives (non-disaster tweets incorrectly classified as disasters) and false negatives (disaster tweets missed by the model) both carry significant consequences. The F1 score, being the harmonic mean of precision and recall, provides a balanced measure that accounts for both types of error, ensuring the model optimizes performance in a way that minimizes the impact of misclassifications.
With the evaluation metric established, we proceed to the core aspects of the training process, which focus on optimizing the model’s performance. Several key data-driven questions guide this process:
- Baseline Determination: How does the model perform with default hyperparameter settings?
- Tokenizer Selection: Which tokenizer should be employed to preprocess the textual data effectively, accounting for nuances such as tokenization of hashtags, mentions, and special characters?
- Data Cleaning: What level of data preprocessing (e.g., removal of stopwords, punctuation, or special characters) is optimal for this task to ensure high-quality input data?
- Embedding Layers: What embedding dimension best balances the model’s ability to capture semantic information without increasing computational complexity unnecessarily?
To answer each of these questions different models will be produced to answer each question, learning from the results of the last question. At each comparison step multiple embedding dimensions will be modeled to see if this has an effect on the outcome.
The goal is to fine-tune these hyperparameters and preprocessing steps to strike a balance between model complexity and predictive accuracy, ensuring the best possible performance on unseen data. Following this optimization process, the top three models—based on their performance in training and validation—will be selected for final evaluation and submission to Kaggle.
Code
import duckdb
con = duckdb.connect()
query = "SELECT * FROM read_parquet('../train_stats_f1/*.parquet', union_by_name=True)"
df = con.execute(query).fetchdf()
df['Positive Ratio'] = (df['Validation True Positive'] + df['Validation False Positive']) / (df['Validation True Positive'] + df['Validation True Negative'] + df['Validation False Positive'] + df['Validation False Negative'])
df["Epoch"] = df["Epoch"].astype(int)Code
def plot_vs_embedding_dim(df, y_col, hue = "Embedding Dimensions", show_legend = True, ylim=None):
plt.figure(figsize=(5.5, 3))
sns.lineplot(df, x = 'Epoch', y=y_col, hue = hue, palette="deep", legend=show_legend)
# Customize legend
if show_legend is True:
plt.legend(
title="\n".join(hue.split(" ")),
loc='center left', # Adjust to the right of the plot
bbox_to_anchor=(1, 0.5), # Position to the right
frameon=False # Remove background and border
)
if ylim is not None:
plt.ylim((0, ylim))
plt.tight_layout()
plt.show()5.1 Baseline Models
5.1.1 Training Loss and F1 Score
Code
5.1.2 Learning Rate and Compute Time
Code
Figure Figure 9 (a) illustrates that across embedding dimensions, the training F1 score consistently improves with an increasing number of epochs. Models with lower embedding dimensions, however, converge to a lower final training F1 score, suggesting that these dimensions may limit the model’s complexity capacity. Similarly, in Figure Figure 9 (b), validation F1 scores also increase with training epochs. However, an upper limit is observed, where embedding dimensions above 16 do not demonstrate a significant difference in validation performance.
Based on these trends, training for 50 epochs appears sufficient for models with embedding dimensions greater than 8 to achieve training stability and reach their performance plateau.
Figure Figure 10 (a) visualizes the learning rate progression over training epochs. The learning rate is adjusted dynamically using a ReduceLROnPlateau scheduler based on validation F1 scores, as shown in the code snippet below:
optimizer = torch.optim.Adam(model.parameters())
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
optimizer,
mode="max",
)
# ...
scheduler.step(val_f1)Models with higher embedding dimensions tend to trigger the scheduler earlier, indicating that these models converge more rapidly. While this is advantageous for reducing the risk of overfitting, it does not necessarily translate into improved validation performance beyond the observed upper limit.
Figure 10 (b) examines the compute times per epoch. Models with embedding dimensions below 256 maintain consistent training times of approximately 2 seconds per epoch. However, as embedding dimensions increase, so do computation times. Given that larger models fail to deliver meaningful performance improvements, it is computationally efficient to select a smaller model that balances performance and training cost effectively.
5.2 Tokenizer Comparison
Code
Code
Figure 11 and Figure 12 show no significant differences in performance between the cased and uncased tokenizers, as measured by training and validation F1 scores. However, since the original dataset retains case sensitivity, preserving this feature through a cased tokenizer aligns with the original characteristics of the data. This decision ensures that potentially meaningful information encoded in capitalization is retained.
5.3 Data Level Comparison
Code
Code
Figure 13 and Figure 14 show no difference in F1 score between diferent data levels. Because altering the data level changes the way the tokenizer functions we will keep the original (unaltered) data as input into the tokenizer
Figures Figure 13 and Figure 14 indicate no measurable difference in F1 scores across various data preprocessing levels. Given that altering the data level modifies how the tokenizer processes the input, it introduces additional complexity without yielding performance benefits. Based on this finding we will use the original unaltered data as input to subsequent models.
5.4 1-3 Embedding Layers
Code
Code
Code
5.5 4-6 Embedding Layers
Code
df_data_one_layer = df.loc[
(df['Comparison Type'] == 'Big Layers') & (df['Embedding Layers'] == 4)
]
df_data_two_layer = df.loc[
(df['Comparison Type'] == 'Big Layers') & (df['Embedding Layers'] == 5)
]
df_data_three_layer = df.loc[
(df['Comparison Type'] == 'Big Layers') & (df['Embedding Layers'] == 6)
]Code
Code
Increasing the embedding dimensions has a subtle but observable effect on the validation F1 score. Models with higher embedding dimensions inherently introduce dropout, which helps mitigate overfitting by regularizing the network. This regularization contributes to greater training stability, as evidenced by reduced fluctuations in F1 scores across epochs. While the performance gains are marginal, the enhanced stability offered by larger embedding dimensions may provide a slight advantage for tasks requiring consistent and reliable predictions.
5.6 Final Models
Code
In the final output models, we observe a steady increase in the training F1 score across all models, while the validation F1 score stabilizes after ~30 epochs. Notably, models with 2 and 6 embedding layers achieve higher validation F1 scores compared to those with 4 embedding layers. This suggests that embedding layer depth plays a nuanced role in model performance, with certain configurations better capturing the underlying patterns in the data. All models demonstrate the ability to fit the input data effectively and achieve stability within the specified 50 epochs.
6 Results
6.1 Kaggle Screenshot
6.2 Results Analysis
Code
# Reshape the data for a grouped bar plot
df_melted = df_final.melt(
id_vars="Embedding Layers",
value_vars=["Public Score"],
var_name="Score Type",
value_name="Score",
)
# Create the grouped bar plot
plt.figure(figsize=(10, 3))
ax = sns.barplot(data=df_melted, x="Embedding Layers", y="Score", hue="Score Type")
for container in ax.containers:
ax.bar_label(container, fmt="%.3f")
plt.ylim((0, 0.85))
plt.title("Scores by Embedding Layers")
plt.show()| Embedding Layers | Score Type | Score | |
|---|---|---|---|
| 0 | 2 | Public Score | 0.75973 |
| 1 | 4 | Public Score | 0.73490 |
| 2 | 6 | Public Score | 0.73337 |
In Figure 21 and Table 9 final model Kaggle scores are visualized and tabulated. For each model the other specifications are:
In Figure 21 and Table 9, the final Kaggle scores for each model are visualized and tabulated. These scores represent the performance of the models on the public leaderboard, providing a benchmark for comparison. Based on these results, the model with 2 embedding layers achieved the highest public score, suggesting that simpler architectures may offer better generalization for this task. This outcome aligns with observations from the validation F1 scores, where models with fewer embedding layers demonstrated comparable or superior performance relative to more complex configurations.
6.2.1 Final Model Specifications
Code
from IPython.display import display, Markdown
df_final['Start Learning Rate'] = 0.001
df_final['End Learning Rate'] = df['Learning Rate']
df_specs = (
df_final[
[
"Learning Optimization",
"Start Learning Rate",
"End Learning Rate",
"Specified Epochs",
"Batch Size",
"Data Level",
"Vocab Size",
"Tokenizer",
"Embedding Dimensions",
"Hidden Dimensions",
]
]
.iloc[0]
.T
)
# Convert series into df
df_specs = df_specs.reset_index()
df_specs.columns = ["Specification", "Value"]
display(Markdown(df_specs.to_markdown(index=False)))| Specification | Value |
|---|---|
| Learning Optimization | default |
| Start Learning Rate | 0.001 |
| End Learning Rate | 1.0000000000000002e-06 |
| Specified Epochs | 50 |
| Batch Size | 256 |
| Data Level | original |
| Vocab Size | 28996 |
| Tokenizer | bert-base-cased |
| Embedding Dimensions | 128 |
| Hidden Dimensions | 256 |
6.2.2 Additional Final Model Statistics
Code
# Reshape the data for a grouped bar plot
df_melted = df_final.melt(
id_vars="Embedding Layers",
value_vars=["Validation Accuracy Score", "Validation Precision Score", "Validation Recall Score"],
var_name="Score Type",
value_name="Score",
)
# Create the grouped bar plot
plt.figure(figsize=(10, 3))
ax = sns.barplot(data=df_melted, x="Embedding Layers", y="Score", hue="Score Type")
for container in ax.containers:
ax.bar_label(container, fmt="%.3f")
plt.ylim((0, 0.85))
plt.title("Scores by Embedding Layers")
plt.show()| Embedding Layers | Score Type | Score | |
|---|---|---|---|
| 0 | 2 | Validation Accuracy Score | 0.768221 |
| 1 | 4 | Validation Accuracy Score | 0.741300 |
| 2 | 6 | Validation Accuracy Score | 0.749836 |
| 3 | 2 | Validation Precision Score | 0.781197 |
| 4 | 4 | Validation Precision Score | 0.737624 |
| 5 | 6 | Validation Precision Score | 0.730475 |
| 6 | 2 | Validation Recall Score | 0.670088 |
| 7 | 4 | Validation Recall Score | 0.655425 |
| 8 | 6 | Validation Recall Score | 0.699413 |
In addition to the public Kaggle scores, the models were validated using accuracy, precision, and recall, and F1 score, shown in Figure 21 and Table 11. These metrics provide a comprehensive view of model performance and highlight differences in how the models handle the classification task.
- Accuracy:
- The model with 2 embedding layers achieved the highest validation accuracy (0.768221), outperforming both the 4-layer and 6-layer models.
- While the accuracy decreases slightly with an increase in embedding layers, the difference is small.
- Precision:
- Precision is highest for the 2-layer model (0.781197), indicating its strength in minimizing false positives.
- As the number of embedding layers increases, precision declines, with the 6-layer model scoring the lowest (0.730475).
- Recall:
- Recall improves with more embedding layers, with the 6-layer model achieving the highest score (0.699413). This suggests that models with more embedding layers are better at capturing true positives, albeit at the expense of increased false positives.
7 Conclusion
7.1 Project Summary
This project explored the classification of disaster-related tweets using Recurrent Neural Networks (RNNs) built with PyTorch. Various configurations, including embedding dimensions, tokenizer choices, and data cleaning levels, were systematically evaluated to tune hyperparameters optimally. The results demonstrated that a 2-layer embedding model achieved the highest overall performance, balancing validation accuracy (0.768221), precision (0.781197), recall (0.670088), and F1 score (0.721389) and a public Kaggle evaluation scores of 0.75973. The findings underscore the value of simplicity in model architecture, with more complex configurations yielding diminishing returns.
7.2 Lessons Learned
Model Architecture and Complexity: Increasing embedding layers and introducing dropout led to marginal stability improvements but did not significantly enhance validation F1 scores. Simpler architectures performed comparably or better in most cases.
Tokenizer and Data Cleaning: The cased tokenizer demonstrated equivalent performance to the uncased version, justifying the retention of the original data’s case sensitivity. Altering data levels disrupted tokenizer behavior without providing measurable benefits.
Layer Size: Models with fewer embedding layers converged faster and triggered the learning rate scheduler earlier, suggesting efficiency advantages in training. All models reached stability within 50 epochs, highlighting the importance of early stopping to reduce computational overhead.
Evaluation Metrics: The F1 score proved to be the most informative metric for this dataset, balancing precision and recall effectively. Relying solely on accuracy or public Kaggle scores would have overlooked critical trade-offs in model performance.
7.3 Areas for Improvement / Future Work
Feature Engineering: Incorporating additional features, such as sentiment analysis scores or tweet metadata, could enhance the models ability to capture nuanced patterns in the data.
Architectural Changes: Future experiments could include transformer-based architectures, such as BERT or GPT, to assess whether advanced models outperform RNNs for this task.
Generalization Analysis: While this project focused on disaster-related tweets, extending the dataset to include non-disaster events could help test the model’s generalization capabilities across broader text classification domains.